COS 511 : Theoretical Machine Learning
نویسنده
چکیده
Take the case of horse racing as an example, we can think of each expert predicting the probability that each horse will win. Another way to motivate this model is to consider coding theory. Suppose Alice wants to send Bob a file consisting of messages x1, . . . , xt, . . . xT , where xt ∈ X, the set of all possible messages. Alice can wait until she sees all messages, which is an offline approach. Now we want to find an online method. Consider there are N coding methods (experts). We want to combine them into a compression algorithm that is as nearly optimal as possible for our file. Note that we don’t have any statistical assumption on the distribution of xt, i.e., they can be arbitrary. As we’ve discussed in defining relative entropy, if p(x) is the probability of x chosen from X, then the optimal code length, i.e., the number of bits, is − lg p(x) for message x . If the compression algorithm (expert) i believes pt,i is the distribution of xt over X, − lg pt,i(xt) is the number of bits used by method i to encode xt. Similarly, our master algorithm uses − lg qt(xt) number of bits to encode xt. The total number of bits used by method i is − ∑
منابع مشابه
Theoretical Machine Learning Cos 511 Lecture #9
In this lecture we consider a fundamental property of learning theory: it is amenable to boosting. Roughly speaking, boosting refers to the process of taking a set of rough “rules of thumb” and combining them into a more accurate predictor. Consider for example the problem of Optical Character Recognition (OCR) in its simplest form: given a set of bitmap images depicting hand-written postal-cod...
متن کاملCOS 511 : Theoretical Machine Learning
In other words, if ≤ 1/8 and δ ≤ 1/8, then PAC learning is not possible with fewer than d/2 examples. The outline of the proof is: To prove that there exists a concept c ∈ C and a distribution D, we are going to construct a fixed distribution D, but we do not know the exact target concept c used. Instead, we will choose c at random. If we get an expected probability of error over c, then there ...
متن کاملCOS 511 : Theoretical Machine Learning
Suppose we are given examples x1, x2 . . . , xm drawn from a probability distribution D over some discrete space X. In the end, our goal is to estimate D by finding a model which fits the data, but is not too complex. As a first step, we need to be able to measure the quality of our model. This is where we introduce the notion of maximum likelihood. To motivate this notion suppose D is distribu...
متن کاملCOS 511 : Theoretical Machine Learning
as the price relative which is how much a stock goes up or down in a single day. St denotes the amount of wealth we have at the start of day t and we assume S1 = 1. We denote wt(i) to be the fraction of our wealth that we have in stock i at the beginning of day t which can be viewed as a probability distribution as ∀i, wt(i) ≥ 0 and ∑ iwt(i) = 1. We can then derive the total wealth in stock i a...
متن کاملCOS 511 : Theoretical Machine Learning
Last class, we discussed an analogue for Occam’s Razor for infinite hypothesis spaces that, in conjunction with VC-dimension, reduced the problem of finding a good PAClearning algorithm to the problem of computing the VC-dimension of a given hypothesis space. Recall that VC-dimesion is defined using the notion of a shattered set, i.e. a subset S of the domain such that ΠH(S) = 2 |S|. In this le...
متن کامل